The steps involved in extracting usable features are:
To do this we must:
gene_ontology.1_2.obo
Alternatively, it looks like these terms are in the gene2go
file retreived from the ncbi ftp site, meaning that we wouldn't need to use goatools:
In [1]:
cd ../../geneconversion/
In [2]:
!head gene2go
However, it's not certain and goatools seems to produce fairly good results, so using that for now:
In [3]:
import csv
In [28]:
from pylab import *
In [4]:
c = csv.reader(open("gene2go"), delimiter="\t")
#skip first line
c.next()
#initialise dictionary
goEntrezdict = {}
for line in c:
#on every line use the Entrez ID as a key and initialise a dictionary
goEntrezdict[line[1]] = {}
Then iterating through the file again we can add entries to this dictionary, each of which is a dictionary containing empty lists indexed by GO IDs:
In [5]:
c = csv.reader(open("gene2go"), delimiter="\t")
c.next()
for line in c:
goEntrezdict[line[1]][line[2]] = []
In [6]:
cd ../geneontology/
In [7]:
import goatools
In [8]:
parsedobo = goatools.obo_parser.GODag('gene_ontology.1_2.obo')
In [9]:
for EntrezID in goEntrezdict.keys():
for goID in goEntrezdict[EntrezID].keys():
goEntrezdict[EntrezID][goID] = parsedobo[goID].name
Looking at an example entry:
In [10]:
print goEntrezdict[goEntrezdict.keys()[0]]
We are looking for annotations that will cross over between pairs of proteins. To do this we need to pick a number of annotations we can test each pair of Entrez IDs for. These should be annotations that are likely to occur in pairs, any annotations that occur in isolation in our set will be useless.
So we have to count all the occurrences of every term in our dictionary of dictionaries. Start by flattening the list:
In [11]:
gotermlist = []
for EntrezID in goEntrezdict.keys():
for goID in goEntrezdict[EntrezID].keys():
gotermlist.append(goEntrezdict[EntrezID][goID])
Then count duplicates in the list using a dictionary:
In [12]:
dupdict = {}
#initialise counters
for k in gotermlist:
dupdict[k] = 0
#count
for k in gotermlist:
dupdict[k] += 1
In [13]:
import heapq
In [14]:
#find term that is repeated most often
print "Terms that occurred most often:"
for k in heapq.nlargest(10, dupdict, key=lambda x: dupdict[x]):
print "\t"+"%s occurred %i times"%(k,dupdict[k])
The three most common terms are not terms, they are domains. There must be a problem with goatools, or a problem with the database file. The rest of the most common terms are all localisations.
There are three domains in the Gene Ontology database. We could split the above to count within each of these quite easily. This was also done by Qi when using the Gene Ontology as features.
Repeating the above for each domain:
In [15]:
cd ../geneconversion/
In [16]:
c = csv.reader(open("gene2go"), delimiter="\t")
c.next()
for line in c:
goEntrezdict[line[1]][line[2]] = {}
In [17]:
for EntrezID in goEntrezdict.keys():
for goID in goEntrezdict[EntrezID].keys():
goEntrezdict[EntrezID][goID][parsedobo[goID].namespace] = parsedobo[goID].name
In [18]:
print goEntrezdict[goEntrezdict.keys()[4]]
In [19]:
gotermdict = {}
#initialise dictionaries
for EntrezID in goEntrezdict.keys():
for goID in goEntrezdict[EntrezID].keys():
for domain in goEntrezdict[EntrezID][goID].keys():
gotermdict[domain] = {}
#initialise counters in dictionaries
for EntrezID in goEntrezdict.keys():
for goID in goEntrezdict[EntrezID].keys():
for domain in goEntrezdict[EntrezID][goID].keys():
term = goEntrezdict[EntrezID][goID][domain]
gotermdict[domain][term] = 0
#count
for EntrezID in goEntrezdict.keys():
for goID in goEntrezdict[EntrezID].keys():
for domain in goEntrezdict[EntrezID][goID].keys():
term = goEntrezdict[EntrezID][goID][domain]
gotermdict[domain][term] += 1
In [40]:
for k in gotermdict.keys():
print "Domain %s:"%k
for ke in heapq.nlargest(10, gotermdict[k], key=lambda x: gotermdict[k][x]):
print "\t"+"%s occurred %i times"%(ke,gotermdict[k][ke])
The question is, which terms should we pick to be effective features? Obviously we want ones with a lot of crossover so there'll actually be something happening. But, we also want to make sure they'll be useful on the proteins we're actually going to use the classifier on at some point. The terms with the largest crossover on the the bait and prey proteins from crossover experiments, taking from all three domains, would probably be the best choice.
Retreiving the bait and prey protein list and building this list of terms to match:
In [26]:
cd ../forGAVIN/pulldown_data/BAITS/
In [29]:
baitids = list(flatten(csv.reader(open("baits_entrez_ids.csv"), delimiter="\t")))
In [30]:
cd ../PREYS/
In [31]:
preyids = list(flatten(csv.reader(open("prey_entrez_ids.csv"), delimiter="\t")))
In [32]:
cd ../../../geneconversion/
Initialising new dictionary for just these Entrez IDs:
In [33]:
baitpreyEntrezdict = {}
#initialise dictionaries in this dictionary
for i in baitids+preyids:
baitpreyEntrezdict[i] = {}
#iterate through gene2go file
c = csv.reader(open("gene2go"), delimiter="\t")
#skip first line
c.next()
for line in c:
try:
baitpreyEntrezdict[line[1]][line[2]] = {}
except KeyError:
#ignore Entrez IDs that don't match
pass
#use goatools to retrieve domains and terms using IDs
#building dictionary with domains as keys this time
for entrezID in baitpreyEntrezdict.keys():
goIDs = baitpreyEntrezdict[entrezID].keys()
#reinitialise the dictionary
baitpreyEntrezdict[entrezID] = {}
for goID in goIDs:
baitpreyEntrezdict[entrezID][parsedobo[goID].namespace] = []
for goID in goIDs:
baitpreyEntrezdict[entrezID][parsedobo[goID].namespace].append(parsedobo[goID].name)
In [34]:
print baitpreyEntrezdict[baitpreyEntrezdict.keys()[0]]
In [35]:
domains = ['molecular_function','cellular_component','biological_process']
In [36]:
#iterate over domains and count the occurence of terms:
gotermdict = {}
for domain in domains:
gotermdict[domain] = {}
#initialise counters
for entrezID in baitpreyEntrezdict.keys():
for domain in baitpreyEntrezdict[entrezID].keys():
for term in baitpreyEntrezdict[entrezID][domain]:
gotermdict[domain][term] = 0
#count
for entrezID in baitpreyEntrezdict.keys():
for domain in baitpreyEntrezdict[entrezID].keys():
for term in baitpreyEntrezdict[entrezID][domain]:
gotermdict[domain][term] += 1
In [37]:
for k in gotermdict.keys():
print "Domain %s:"%k
for ke in heapq.nlargest(10, gotermdict[k], key=lambda x: gotermdict[k][x]):
print "\t"+"%s occurred %i times"%(ke,gotermdict[k][ke])
Saving the 30 most common of each of these domains to a file in the geneontology directory:
In [38]:
cd ../geneontology/
In [41]:
!git annex unlock molecular_function.baitprey.term
!git annex unlock biological_process.baitprey.term
!git annex unlock cellular_component.baitprey.term
In [42]:
for k in gotermdict.keys():
f = open("%s.baitprey.term"%k, "w")
for ke in heapq.nlargest(30, gotermdict[k], key=lambda x: gotermdict[k][x]):
f.write("%s,%i"%(ke,gotermdict[k][ke])+"\n")
f.close()
In [43]:
ls
In [44]:
!head biological_process.baitprey.term
We intend to produce a files of the following form which can be used in conjunction with the ocbio.extract
module:
Protein 1 | Protein 2 | GO molecular function feature 1 | ... | GO biological process feature 90 |
---|---|---|---|---|
1234 | 3214 | 0 | ... | 1 |
... | ... | ... | ... | ... |
Where a 1 is placed in an entry if both proteins share that Gene Ontology term, otherwise it is zero.
To build this file we must iterate through all combinations of all the Entrez identifiers in the gene2go
file:
In [54]:
cd ../geneconversion/
In [55]:
# read through this file and make a list of GO IDs for the example gene chosen above
c = csv.reader(open("gene2go"), delimiter="\t")
c.next()
#take second column and make it a set
EntrezIDs = []
for line in c:
#only human IDs
if line[0] == "9606":
EntrezIDs.append(line[1])
EntrezIDs = set(EntrezIDs)
In [56]:
#build a dictionary as above for this list of Entrez IDs
goEntrezdict = {}
for i in EntrezIDs:
goEntrezdict[i] = {}
#iterate through gene2go file
c = csv.reader(open("gene2go"), delimiter="\t")
#skip first line
c.next()
for line in c:
try:
goEntrezdict[line[1]][line[2]] = {}
except KeyError:
#ignore Entrez IDs that don't match
pass
#use goatools to retrieve domains and terms using IDs
#building dictionary with domains as keys this time
for entrezID in goEntrezdict.keys():
goIDs = goEntrezdict[entrezID].keys()
#reinitialise the dictionary
goEntrezdict[entrezID] = {}
for goID in goIDs:
goEntrezdict[entrezID][parsedobo[goID].namespace] = []
for goID in goIDs:
goEntrezdict[entrezID][parsedobo[goID].namespace].append(parsedobo[goID].name)
In [57]:
#get the lists of terms we're trying to match to:
terms = {}
for k in gotermdict.keys():
terms[k] = []
for ke in heapq.nlargest(30, gotermdict[k], key=lambda x: gotermdict[k][x]):
terms[k].append(ke)
In [13]:
import itertools
Moving to external to avoid filling up my main hard drive:
In [1]:
cd /mnt/external/remotes/
In [55]:
mkdir geneontology
In [2]:
cd geneontology/
In [57]:
#write header to file
c = csv.writer(open("geneontology.flat.pairs.txt", "w"),delimiter="\t")
line = ["Protein 1", "Protein 2"]
for k in terms.keys():
for term in terms[k]:
line.append("%s:%s"%(k,term))
c.writerow(line)
In [40]:
import scipy.misc
In [ ]:
#how long will this take?
npairs = scipy.misc.comb(len(EntrezIDs),2)
print "Number of lines to write to file %i"%npairs
lcount = 0
#first build vectors for each Entrez ID:
govectordict= {}
for entrezID in EntrezIDs:
vec = []
for domain in domains:
for term in terms[domain]:
try:
if term in goEntrezdict[entrezID][domain]:
vec.append(1)
else:
vec.append(0)
except KeyError:
vec.append(0)
#save to dictionary
govectordict[entrezID] = array(vec[:])
#iterate over combinations of Entrez pairs
for pair in itertools.combinations(EntrezIDs,2):
#building the line as it goes
line = list(pair) + list(govectordict[pair[0]]*govectordict[pair[1]])
if lcount%1000000 == 0:
print "%.4f"%(100.0*lcount/npairs)+"% lines written"
lcount += 1
#write this new line to file
c.writerow(line)
Left the above script to run overnight, it produced a 14gb file which is too large to query with wc -l
in reasonable time.
Instead, used [this script][qwc] to find the approximate number of lines.
Should be relatively accurate in this case as the structure of the file is quite regular:
In [3]:
!~/qwc.sh geneontology.flat.pairs.txt
Which is a little less than halfway to the full line count shown above.
Alternatively, we could avoid using the flat file altogether and save it to a database immediately. Attempting to write such a database to see how long that will take:
In [4]:
sys.path.append("/home/gavin/Documents/MRes/opencast-bio/")
In [5]:
import ocbio.extract
In [58]:
#how long will this take?
npairs = scipy.misc.comb(len(EntrezIDs),2)
print "Number of lines to write to file %i"%npairs
lcount = 0
#first build vectors for each Entrez ID:
govectordict= {}
for entrezID in EntrezIDs:
vec = []
for domain in domains:
for term in terms[domain]:
try:
if term in goEntrezdict[entrezID][domain]:
vec.append(1)
else:
vec.append(0)
except KeyError:
vec.append(0)
#save to dictionary
govectordict[entrezID] = array(vec[:])
#open database
db = ocbio.extract.openpairshelf("geneontology.pairs.db")
#iterate over combinations of Entrez pairs
for pair in itertools.combinations(EntrezIDs,2):
#building the line as it goes
line = list(govectordict[pair[0]]*govectordict[pair[1]])
if lcount%100000 == 0:
print "%.4f"%(100.0*lcount/npairs)+"% lines written"
lcount += 1
#write this new line to database
db[frozenset(pair)] = line
db.close()
Seems like the efficient way to do this would be to use a generator. Scikit-learn would probably support a generator. Interestingly, since shelve and pickle can store arbitrary Python objects there's nothing stopping me from storing the generator as an object in a pickle file.
Then the final file we want to generate would be itself a generator which produces feature vectors corresponding to a list of protein pairs.
This could be added as an option to the ocbio.extract.ProteinParser
class, but some adjustment to the system would have to be done.
The parsers themselves could be stored and then these used to index protein pairs.
To build a generator function from the above code we just need to supply variables used when writing the above file:
In [78]:
def genGOfvector(pairs,goEntrezdict,terms):
'''Generator function for creating feature vectors for Gene Ontology:
Taking as input arguments:
* list of pairs of proteins to map (pairs should be frozensets)
* dictionary mapping from Entrez IDs to dictionary containing GO terms keyed by domain
* dictionary of terms we're interested in keyed by domain
Returns iterable sequence of corresponding feature vectors'''
# First define the domains, I'm guessing they're not changing
domains = ['molecular_function','cellular_component','biological_process']
# get a flattened set of Entrez IDs:
EntrezIDs = set(flatten(map(list,pairs)))
# initialise the vectors we're going to be using:
govectordict= {}
for entrezID in EntrezIDs:
vec = []
for domain in domains:
for term in terms[domain]:
try:
if term in goEntrezdict[entrezID][domain]:
vec.append(1)
else:
vec.append(0)
except KeyError:
vec.append(0)
#save to dictionary
govectordict[entrezID] = array(vec[:])
#iterate over combinations of Entrez pairs
for pair in pairs:
#building the line as it goes
line = list(govectordict[pair[0]]*govectordict[pair[1]])
#yield this feature vector
yield line
In [87]:
testpairs = itertools.combinations(list(EntrezIDs)[0:10],2)
fvectors = genGOfvector(list(testpairs),goEntrezdict,terms)
In [63]:
import pickle
In [93]:
f = open("testgen.pickle","wb")
pickle.dump(fvectors,f)
f.close()
This is a problem and there are two ways around it. Could write a pseudo-generator class, instantiate it and pickle it. Or, since I really only want to generate a single set of features each time could make it a function:
In [81]:
def genGOfvector(pair,goEntrezdict,terms):
'''Generator function for creating feature vectors for Gene Ontology:
Taking as input arguments:
* list of pairs of proteins to map (pairs should be frozensets)
* dictionary mapping from Entrez IDs to dictionary containing GO terms keyed by domain
* dictionary of terms we're interested in keyed by domain
Returns iterable sequence of corresponding feature vectors'''
# First define the domains, I'm guessing they're not changing
domains = ['molecular_function','cellular_component','biological_process']
# initialise the vectors we're going to be using:
govectordict= {}
for entrezID in pair:
vec = []
for domain in domains:
for term in terms[domain]:
try:
if term in goEntrezdict[entrezID][domain]:
vec.append(1)
else:
vec.append(0)
except KeyError:
vec.append(0)
#save to dictionary
govectordict[entrezID] = array(vec[:])
#iterate over combinations of Entrez pairs
#building the line as it goes
line = list(govectordict[pair[0]]*govectordict[pair[1]])
#yield this feature vector
return line
In [61]:
testpairs = itertools.combinations(list(EntrezIDs)[0:10],2)
testpair = list(testpairs)[1]
fvector = genGOfvector(testpair,goEntrezdict,terms)
In [98]:
#then try to pickle this
f = open("testgen.pickle","wb")
pickle.dump(genGOfvector,f)
f.close()
Unfortunately, this function has no memory so we'd have to pickle the goEntrezdict and the terms separately.
It is easier to create an object with it's own __getitem__
method which can then be instantiated and pickled with the required data stored.
Then this can be handed straight into the parsing code and all that's required is to make sure it has a definition of the class available.
In [88]:
import numpy as np
class GOfvectors():
def __init__(self,goEntrezdict,terms):
self.goEntrezdict = goEntrezdict
self.terms = terms
return None
def __getitem__(self,key):
pair = list(key)
if len(pair)==1:
pair = pair*2
# First define the domains, I'm guessing they're not changing
domains = ['molecular_function','cellular_component','biological_process']
# initialise the vectors we're going to be using:
govectordict= {}
for entrezID in pair:
vec = []
for domain in domains:
for term in self.terms[domain]:
try:
if term in self.goEntrezdict[entrezID][domain]:
vec.append(1)
else:
vec.append(0)
except KeyError:
vec.append(0)
#save to dictionary
govectordict[entrezID] = np.array(vec[:])
#iterate over combinations of Entrez pairs
#building the line as it goes
line = list(govectordict[pair[0]]*govectordict[pair[1]])
#yield this feature vector
return line
Then save the above to a file in ocbio, if rerunning this make sure the line number is correct:
In [95]:
cd ../opencast-bio/ocbio/
In [97]:
%save -f geneontology.py 88
In [101]:
sys.path.append("/home/gavin/Documents/MRes/opencast-bio/")
In [102]:
import ocbio.geneontology
In [103]:
test = ocbio.geneontology.GOfvectors(goEntrezdict,terms)
In [104]:
print test[testpair]
In [98]:
cd ../../geneontology/
In [105]:
#pickle the instance with the data in it
f = open("testgen.pickle","wb")
pickle.dump(test,f)
f.close()
In [106]:
print testpair
In [110]:
f = open("geneontology.terms.txt", "w")
c = csv.writer(f,delimiter="\t")
for k in terms.keys():
for t in terms[k]:
c.writerow([k,t])
f.close()
In [111]:
!cat geneontology.terms.txt